Chapter 5 Exercises

First lets install the appropriate packages

library(printr)
library(fpp2)
library(tidyverse)
library(ggthemes)
library(jtools)

theme_set(theme_clean())

5.1: Daily Electricity Demand

a) Data exploration

# First we download the data from fpp2 to save as a dataframe with Date for pleasant aesthetics
data("elecdaily")
elec.df <- as.data.frame(elecdaily)
elec.df$date <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-12-31"), by = 1)

elec.df%>% 
ggplot(aes(x = date, y = Demand)) + geom_line(color = 'red') +
  labs(title = 'Victoria, Australia Electricity usage during 2014',
       x = 'Date',
       y = 'Demand')
Total electricity demand in GW for Victoria, Australia, every half-hour during 2014

Total electricity demand in GW for Victoria, Australia, every half-hour during 2014

elec.df %>% 
ggplot(aes(x = date, y = Temperature)) + geom_line(color = 'blue') +
  labs(title = 'Victoria, Australia Temperature during 2014',
       x = 'Date',
       y = 'Temperature')
Half-hourly temeperatures for Melbourne in celsius

Half-hourly temeperatures for Melbourne in celsius

By the way, the temperature plot is accurate and starts in January. Melbourne experienced maximum temperatures on January 16-17, 2014. At first I thought the start date for the temperature collection was not January but it turns out it is! Lets go on to check out the relationship between daily electricity usage and temperature.

# lets use the fpp2 tslm() function
lm1 <- tslm(Demand ~ Temperature, data = elecdaily)
summ1 <- summ(lm1, model.info=T)
export_summs(summ1, output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Electricity Demand'))
Electricity Demand
(Intercept)212.39 ***
(5.01)   
Temperature0.42    
(0.23)   
N365       
R squared0.01    
P value0.07    
AIC3433.49    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We can see that there is a positive relationship between temperature and electricity demand. This makes sense - if the temperature increases, we would imagine people would cool themselves down by turning on a fan or air conditioning. Therefore, higher temperatures would lead to more electricity use as opposed to lower temperatures which would increase heat usage. For a quick reference, the average temperature in 2014 in Melbourne was 21.26 celsius.

The question asks us to use the first 20 days, so from here on out we will be using the first 20 days data. The regression for it is given below.

daily20 <- head(elecdaily, 20)
daily.df <- as.data.frame(daily20)
daily.df$date <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-01-20"), by = 1)

lm20 <- tslm(Demand ~ Temperature, data = daily20)
summ20 <- summ(lm20, model.info=T)
export_summs(summ20, output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Electricity Demand'))
Electricity Demand
(Intercept)39.21 *  
(17.99)   
Temperature6.76 ***
(0.61)   
N20       
R squared0.87    
P value0.00    
AIC184.30    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We get the same positive relationship, but it is now significant at the 1% level.

b) Checking our simple regression’s assumptions

We can check our assumptions with checkresiduals() from the fpp2 package.

par(mfrow = c(2,2))
checkresiduals(lm20)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

We can see that for the model which includes the first 20 days there are not outliers. Furthermore, we do not have evidence of auto correlation. A fit of the regression is plotted below. NOTE This would not be the case if we were to use the entire yearly daily dataset.

daily.df %>% 
  ggplot(aes(x = Temperature, y = Demand)) + 
  geom_point() + 
  geom_smooth(method = 'lm', se= F)

I would say this model is adequate for forecasting Demand or explaining the relationship between Temperature and Demand.

c, d) Simple forecasts with prediction interval

forecast <- forecast(lm20, 
                     newdata = data.frame(Temperature = c(15,35)))

knitr::kable(forecast) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F, 
                            position = 'float_right')
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
4.285714 140.5701 108.6810 172.4591 90.21166 190.9285
4.428571 275.7146 245.2278 306.2014 227.57056 323.8586

The first row lists the forecast for 15 degrees celsius and then 35 degrees celsius. I would say that the prediction for 35 degrees celsius is fairly accurate because there is a recorded temperature within the first 20 days that is close to 35 celsius (We have a data point that is 34 degrees celsius). The forecast for 15 degrees might be slightly less so, since the closest temperature we have to is is 19.6 degrees celsius.

e) Entire dataset

The entire plot can be found under part a) Data exploration and as is suggested by the plot and discussion there, this relationship is not as strong or useful for prediction electricity usage for the entirety of the year.

5.2: Olympic Winning Times

a) Data exploration

data("mens400")
autoplot(mens400) +
  labs(title = 'Winning Olympic Times over the Years',
       x = 'Year',
       y = 'Winning Time')
Winning times (in seconds) for the men’s 400 meters final in each Olympic Games

Winning times (in seconds) for the men’s 400 meters final in each Olympic Games

Two things stand out from the graph of the winning times. First, we have some missing values, this is because the Olympic Games did not take place during WW1 or WW2. Secondly, The winning time has been decreasing throughout the years, which means the athletes are getting better. However, this trend has been leveling off since the late 1990s.

b) Simple regression

# To calculate the average rate of decrease, we will make a dataframe object
mens.df <- as.data.frame(mens400)

# I just wanted to have a proper dependent variable name
mens.df$win <- mens.df$x

# We could use a years variable, but I think its better to use a trend variable for better interpretability in the regression results
mens.df$t <- seq(0, (2016-1896), by = 4)

lm_mens <- tslm(win ~ t, data = mens.df)
summ_mens <- summ(lm_mens, model.info=T)
export_summs(summ_mens, output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Win Time'),
             coefs = c("Trend" = 't'))
Win Time
Trend-0.06 ***
(0.01)   
N31       
R squared0.82    
P value0.00    
AIC       
*** p < 0.001; ** p < 0.01; * p < 0.05.
# we can use the regular year variable for the actual plotting
autoplot(mens400) +
  labs(title = 'Winning Olympic Times over the Years',
       x = 'Year',
       y = 'Winning Time') + 
  geom_smooth(method = 'lm', se = F)

We can see that, on average, the winning time has decreased by around six seconds year over year.

c) Residuals of our simple regression

checkresiduals(lm_mens)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

Besides a couple outliers and a violation of auto correlation, we can say that the simple regression generally fits the data well.

d) Simple forecast

forecast <- forecast(lm_mens, 
                     newdata = data.frame(t = 124))

knitr::kable(forecast) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F, 
                            position = 'float_right')
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020 42.04231 40.44975 43.63487 39.55286 44.53176

The point prediction for the win time of the men’s 400 for the 2020 Olympics is 42 seconds, but it could confidently be as low as 39 seconds or as high as 44.5 seconds. This forecast, using a linear model, assumes the underlying times are given as a normal distribution but this isn’t the case, as can be seen from [c) Residuals of out simple regression]

5.3 Easter()

Ausbeer returns total quarterly beer production in Australia (in megalitres) from 1956:Q1 to 2010:Q2.

Easter() returns a vector of 0’s and 1’s or fractional results if Easter spans March and April in the observed time period. Easter is defined as the days from Good Friday to Easter Sunday inclusively, plus optionally Easter Monday if easter.mon=TRUE.

5.4 Elasticity

\[ 1:ln(y) = \beta_0 + \beta_1ln(x) + \epsilon \\ 2: \beta_1*\frac{1}{x} = \frac{dy}{dx}*\frac{1}{y} \\ 3:\beta_1 = \frac{dy}{dx}*\frac{x}{y} \]

5.5 Shop Sales

a) Data exploration

autoplot(fancy) + 
  labs(title = 'Sales over time',
       y = 'Sales',
       x = 'Year')

We can see from the above graph that the sales are increasing over time and they are highly seasonal, as is expected from the problem description - sales spike in March.

b) Log transform

We have that the seasonal variations in our data that pushes the range upwards, here is a histogram of the sales.

hist(fancy)

A logarithm transform would allow us to make the data more normal and therefore we could use the linear model, the transformed histogram is shown below.

hist(log(fancy))

c) Log regression

Below are the results of using the log transfromed fancy with a seasonal fit.

fancy.ln <- log(fancy)

# Surfing festival dummy
fest <- rep(0, length(fancy))
fest[seq_along(fest)%%12 == 3] <- 1
# The festival started in 1988, our data goes back to 1987
fest[3] <- 0 
# Save it as a timeseries
fest <- ts(fest, start = c(1987,1), freq = 12)

fancy.df <- data.frame(fancy.ln, fest)
lm.ln <- tslm(fancy.ln ~ trend + season + fest, data = fancy.df)
export_summs(summ(lm.ln), output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Log Fancy Sales'))
Log Fancy Sales
(Intercept)7.62 ***
(0.07)   
trend0.02 ***
(0.00)   
season20.25 *  
(0.10)   
season30.27    
(0.19)   
season40.38 ***
(0.10)   
season50.41 ***
(0.10)   
season60.45 ***
(0.10)   
season70.61 ***
(0.10)   
season80.59 ***
(0.10)   
season90.67 ***
(0.10)   
season100.75 ***
(0.10)   
season111.21 ***
(0.10)   
season121.96 ***
(0.10)   
fest0.50 *  
(0.20)   
N84       
R squared0.96    
P value0.00    
AIC-35.96    
*** p < 0.001; ** p < 0.01; * p < 0.05.

d) Log model residuals

plot(lm.ln$residuals,
     main = 'Residuals of Log Model for fancy sales',
     ylab = 'Residuals',
     type = 'p')

I don’t see anything particularly interesting with the residuals against time.

data.frame(resid = lm.ln$residuals, fit = lm.ln$fitted.values) %>% 
  ggplot(aes(x=fit, y=resid)) + geom_point() + 
  labs(title = 'Fitted Values vs Residuals for Log Fancy Regression',
       y = 'Residuals',
       x = 'Fitted Values')

There is nothing too out of the ordinary with the fitted values either.

e) Box plots

boxplot(res ~ month, data = data.frame(res = lm.ln$residuals, month = rep(1:12,7)),
        main = 'Boxplot of Resdiuals by Months')

From the boxplots we can see that months 8 (August), 9 (September), and 10 (October) can have a high variance, in addition to months 5 (April) and 3 (March).

f) Coefficients

Because we have a log transformed dependent variable, we would have to interpret the exact effect in terms of log effects. Otherwise, we can look at the sign and magnitude of our effects. The coefficients of each season, or month variable, tell in which direction sales move during that season. All of these variables are positive which matches the graphs since the tend is increasing. The trend and the festival coefficients are also important in forecasting and determining the pattern of sales.

g) Breusch-Godfrey test

lmtest::dwtest(lm.ln, alt="two.sided")
## 
##  Durbin-Watson test
## 
## data:  lm.ln
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0

The DW statistic is significant which indicates that our model is still suffering from autocorrelation which violates the linear model assumptions the are required to fit a model.

h) Predictions

# new festival dates
new_fes = rep(0, 36)
new_fes[seq_along(new_fes)%%12 == 3] <- 1

forecast <- forecast(lm.ln, newdata=data.frame(fest = new_fes))

knitr::kable(forecast) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1994 9.491352 9.238522 9.744183 9.101594 9.88111
Feb 1994 9.764789 9.511959 10.017620 9.375031 10.15455
Mar 1994 10.302990 10.048860 10.557120 9.911228 10.69475
Apr 1994 9.941465 9.688635 10.194296 9.551707 10.33122
May 1994 9.988919 9.736088 10.241749 9.599161 10.37868
Jun 1994 10.050280 9.797449 10.303110 9.660522 10.44004
Jul 1994 10.233926 9.981096 10.486756 9.844168 10.62368
Aug 1994 10.233456 9.980625 10.486286 9.843698 10.62321
Sep 1994 10.336841 10.084010 10.589671 9.947083 10.72660
Oct 1994 10.436923 10.184092 10.689753 10.047165 10.82668
Nov 1994 10.918299 10.665468 11.171129 10.528541 11.30806
Dec 1994 11.695812 11.442981 11.948642 11.306054 12.08557
Jan 1995 9.755590 9.499844 10.011336 9.361338 10.14984
Feb 1995 10.029027 9.773281 10.284773 9.634775 10.42328
Mar 1995 10.567228 10.310517 10.823938 10.171489 10.96297
Apr 1995 10.205703 9.949957 10.461449 9.811451 10.59996
May 1995 10.253157 9.997411 10.508903 9.858904 10.64741
Jun 1995 10.314518 10.058772 10.570264 9.920265 10.70877
Jul 1995 10.498164 10.242418 10.753910 10.103911 10.89242
Aug 1995 10.497694 10.241948 10.753440 10.103441 10.89195
Sep 1995 10.601079 10.345333 10.856825 10.206826 10.99533
Oct 1995 10.701161 10.445415 10.956907 10.306908 11.09541
Nov 1995 11.182537 10.926791 11.438282 10.788284 11.57679
Dec 1995 11.960050 11.704304 12.215796 11.565797 12.35430
Jan 1996 10.019828 9.760563 10.279093 9.620151 10.41951
Feb 1996 10.293265 10.034000 10.552530 9.893588 10.69294
Mar 1996 10.831466 10.571566 11.091365 10.430810 11.23212
Apr 1996 10.469941 10.210676 10.729206 10.070264 10.86962
May 1996 10.517395 10.258130 10.776659 10.117717 10.91707
Jun 1996 10.578756 10.319491 10.838021 10.179079 10.97843
Jul 1996 10.762402 10.503137 11.021667 10.362725 11.16208
Aug 1996 10.761931 10.502667 11.021196 10.362255 11.16161
Sep 1996 10.865317 10.606052 11.124582 10.465640 11.26499
Oct 1996 10.965399 10.706134 11.224664 10.565722 11.36508
Nov 1996 11.446775 11.187510 11.706039 11.047097 11.84645
Dec 1996 12.224288 11.965023 12.483553 11.824611 12.62396

i) Transfromed predictions

forecast$mean <- exp(forecast$mean)
forecast$upper <- exp(forecast$upper)
forecast$lower <- exp(forecast$lower)
forecast$x <- exp(forecast$x)

knitr::kable(forecast) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1994 13244.70 10285.82 17054.73 8969.583 19557.43
Feb 1994 17409.81 13520.45 22418.00 11790.284 25707.73
Mar 1994 29821.65 23129.40 38450.24 20155.412 44123.68
Apr 1994 20774.16 16133.21 26750.16 14068.696 30675.62
May 1994 21783.73 16917.24 28050.15 14752.395 32166.37
Jun 1994 23162.27 17987.81 29825.24 15685.969 34201.95
Jul 1994 27831.56 21613.98 35837.72 18848.111 41096.73
Aug 1994 27818.48 21603.82 35820.87 18839.249 41077.41
Sep 1994 30848.42 23956.87 39722.43 20891.193 45551.50
Oct 1994 34095.57 26478.61 43903.67 23090.230 50346.32
Nov 1994 55176.84 42850.31 71049.28 37366.903 81475.41
Dec 1994 120067.79 93244.59 154607.08 81312.400 177294.90
Jan 1995 17250.40 13357.65 22277.59 11629.938 25587.08
Feb 1995 22675.20 17558.28 29283.31 15287.252 33633.55
Mar 1995 38840.85 30046.98 50208.44 26146.972 57697.39
Apr 1995 27057.06 20951.33 34942.16 18241.435 40133.06
May 1995 28371.96 21969.51 36640.25 19127.918 42083.42
Jun 1995 30167.42 23359.80 38958.95 20338.387 44746.58
Jul 1995 36248.88 28068.91 46812.70 24438.412 53767.06
Aug 1995 36231.84 28055.72 46790.69 24426.922 53741.78
Sep 1995 40178.16 31111.50 51887.06 27087.467 59595.26
Oct 1995 44407.37 34386.35 57348.77 29938.733 65868.34
Nov 1995 71864.42 55647.40 92807.48 48449.831 106594.69
Dec 1995 156380.86 121091.75 201954.08 105429.448 231955.81
Jan 1996 22467.57 17336.40 29117.46 15065.329 33506.86
Feb 1996 29533.04 22788.25 38274.14 19802.984 44043.89
Mar 1996 50587.81 39009.73 65602.25 33887.802 75517.62
Apr 1996 35240.15 27191.96 45670.42 23629.808 52555.15
May 1996 36952.72 28513.41 47889.88 24778.151 55109.18
Jun 1996 39291.20 30317.82 50920.48 26346.183 58596.65
Jul 1996 47211.93 36429.60 61185.57 31657.322 70409.18
Aug 1996 47189.73 36412.48 61156.80 31642.439 70376.07
Sep 1996 52329.57 40378.47 67817.91 35088.887 78041.33
Oct 1996 57837.85 44628.77 74956.52 38782.394 86256.08
Nov 1996 93598.96 72222.70 121302.09 62761.521 139588.15
Dec 1996 203676.38 157160.50 263959.89 136572.460 303751.35
plot(forecast)

j) Model improvements

I think we could have used a seasonal multiplicative model to forecast the future sales, since there is clearly an increasing seasonal trend.

5.6 Gasoline

a) Harmonic regression

Lets first plot the data of the entire dataset.

autoplot(gasoline) +
  labs(title = 'Supplies of US finished motor gasoline product',
       y = 'Gasoline')
Gasoline measured in millions of gallons a day

Gasoline measured in millions of gallons a day

Lets try a fourier series with harmonic k = 1 and k = 5 first.

# We can only use up to 2004 for our training data
gas.2004 <- window(gasoline, end = 2005)

# First order
fit.1 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 1))
# Fifth order
fit.5 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 5))

autoplot(gas.2004) +
      autolayer(fit.1$fitted.values,
                series = 'K = 1',
                size = 2) +
      autolayer(fit.5$fitted.values,
                series = 'K = 5',
                size = 1.2) + 
      ggtitle('Fourier Transform K=1 and K=5') +
      ylab("Gasoline")

We can see that a harmonic with K=5 tracks the information better than a K=1 harmonic, which is more rigid being that it is only order one. Now we can see the improvement that K=10, 15, and 20 add to our model.

fit.10 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 10))
fit.15 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 15))
fit.20 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 20))

autoplot(gas.2004) +
      autolayer(fit.5$fitted.values,
                series = 'K = 5',
                size = 2.5) + 
        autolayer(fit.10$fitted.values,
                series = 'K = 10',
                size = 2) + 
        autolayer(fit.15$fitted.values,
                series = 'K = 15',
                size = 1) + 
        autolayer(fit.20$fitted.values,
                series = 'K = 20',
                size = 0.75) + 
      ggtitle('Fourier Transforms K=5 to K = 20') +
      ylab("Gasoline")

We can see from the overlays that the next iterations produce a better a fit although the amplitude seems to get smaller but the variance is captured better.

b) Model selection

cvs <- as.data.frame(rbind(CV(fit.1), CV(fit.5), CV(fit.10),
                           CV(fit.15), CV(fit.20)))

rownames(cvs) <- c('K = 1', 'K = 5', 'K = 10',
                'K = 15', 'K = 20')

cvs %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            full_width = F)
CV AIC AICc BIC AdjR2
K = 1 0.0820192 -1813.292 -1813.208 -1790.354 0.8250858
K = 5 0.0755365 -1873.234 -1872.723 -1813.596 0.8406928
K = 10 0.0713575 -1915.014 -1913.441 -1809.500 0.8516102
K = 15 0.0719086 -1910.231 -1906.988 -1758.841 0.8525942
K = 20 0.0722383 -1908.017 -1902.469 -1710.753 0.8540588

From these we can see that we should choose the K = 10 model, the K values above 10 overfit the training data.

###c) Residuals

Below are the residuals graphs

checkresiduals(fit.10)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 155.45, df = 104, p-value = 0.0008135

d, e) Forecast and Plot

forecast <- forecast(fit.10,
                     newdata=data.frame(fourier(gas.2004, K = 10, h = 52)))

autoplot(window(gasoline, start = 2004, end = 2006))+
  autolayer(forecast) + 
  labs(title = 'Forecasts of 2006',
       y = 'Gasoline')

We can see that our forecasting interval does a pretty good job of predicting the next year’s actual value. Below is a graph without the interval.

autoplot(window(gasoline, start = 2004, end = 2006))+
  autolayer(forecast$mean, color = 'blue') + 
  labs(title = 'Mean Gasoline Forecasts',
       y = 'Gasoline')

When we remove the interval we see that the model does not predict the large drop in mid 2005, this might be a seasonal event or an unforeseen one. But this drop is not too surprising, it is within the 95% interval.

5.7

a) Data exploration

data("LakeHuron")
autoplot(LakeHuron) + 
  labs(title = 'Water Level of Lake Huron over Time',
       y = 'Water Level')
Water level in Feet

Water level in Feet

We don’t see a very obvious trend in the data, there might be some seasonality or cyclicality but its hard to tell and it appears that the long term average level of the lake is rather constant.

b) Piecewise and linear model

# Regular linear model
lm.water <- tslm(huron ~ trend)
export_summs(summ(lm.water), output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Water Level'))
Water Level
(Intercept)10.20 ***
(0.23)   
trend-0.02 ***
(0.00)   
N98       
R squared0.27    
P value0.00    
AIC306.10    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# Piecewise
t <- time(huron)
t.break <- 1915
t.piece <- ts(pmax(0,t-t.break), start=1875)
pw.huron <- tslm(huron ~ t + t.piece)
export_summs(summ(pw.huron), output='html', 
             statistics = c("N" = "nobs", 
                            "R squared" = "r.squared", 
                            "P value" = "p.value",
                            "AIC" = "AIC"),
             model.names = c('Water Level'))
Water Level
(Intercept)132.91 ***
(19.98)   
t-0.06 ***
(0.01)   
t.piece0.06 ***
(0.02)   
N98       
R squared0.38    
P value0.00    
AIC291.77    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We can see that, due to the lower AIC, the piecewise function appears to be a better fit for the data than the linear model.

c) Forecasts

# 1980 is 8 years forward
# First lets fit the linear trend
forecast <- forecast(lm.water, h=8)

# Then the piecewise
t.new <- t[length(t)] + seq(8)
t.piece.new <- t.piece[length(t.piece)]+seq(8)

forecast.pw <- forecast(pw.huron,newdata =  data.frame('t' = t.new, 't.piece' = t.piece.new))

# Plotting the linear
autoplot(huron) +
  autolayer(forecast) + 
  autolayer(lm.water$fitted.values, color = 'blue') +
  labs(title = 'Water Level of Lake Huron over Time',
       y = 'Water Level')

autoplot(huron) +
  autolayer(forecast.pw) +
  autolayer(pw.huron$fitted.values, color = 'blue') +
  labs(title = 'Water Level of Lake Huron over Time',
       y = 'Water Level')

We can see that the piecewise model captures the initial drop that we see in the series, where as the linear model fails to do so.

Chapter 6 Exercises

6.1 Moving Averages

Centered moving averages can be smoothed by another moving average. This creates a double moving average. In the case of a 3x5 moving average, this signifies a 3 period moving average of a 5 period moving average.

This means we have three periods, weight 1/3 each of our observations averaged by 5. We have these three parts that are simplified below:

\[ (\frac{X_1+X_2+X_3+X_4+X_5}{5}*\frac{1}{3}) + \frac{X_2+X_3+X_4+X_5+X_6}{5}*\frac{1}{3} + \frac{X_3+X_4+X_5+X_6+X_7}{5}*\frac{1}{3} \\ X_1*\frac{1}{15} + X_2*\frac{2}{15} + X_3*\frac{3}{15}+X_4*\frac{3}{15}+X_5\frac{3}{15}+X_6\frac{2}{15}+X_7\frac{1}{15} \]

Which are equivalent to the stated weights for the 7 period MA.

6.2 Plastics

a) Data exploration

autoplot(plastics) + 
  labs(title = 'Monthly Sales',
       y = 'Plastic Sales',
       x = 'Month')
Units are in thousands

Units are in thousands

We can see a seasonal fluctuation with an upward trend for the monthly sales. From eyeballing the plot, I would say that the seasonal trend is additive.

b, c) Multiplicative decomposition

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + 
  labs(x = "Month") +
  ggtitle("Monthly Product Sales Decomposition")

We can see from the graph that indeed there is a seasonal pattern in the data in addition to a general upward trend.

d) Seasonally adjusted data

autoplot(plastics) + 
  labs(title = "Monthly Sale Predictions", 
       y="Plastic Sales", x = 'Month') + 
  autolayer((plastics/decompose(plastics, type="multiplicative")$seasonal),
            series = 'Decomposed')

e) Outlier on decomposition

# For reproducability
set.seed(42)
outlier <- sample(1:length(plastics), 1)
plastics2 <- plastics
plastics2[outlier] <- plastics[outlier] + 500

autoplot(plastics2) +
  labs(title = "Monthly Sales, Observation 19 Outlier",
       y="Plastic Sales", x = 'Month') + 
  autolayer((plastics2/decompose(plastics2, type="multiplicative")$seasonal),
            series = 'Decomposed')

We can see that the decomposition picked up on the outlier and included it in the backtrack forecasting.

# For reproducability
plastics3 <- plastics
plastics3[(length(plastics)-1)] <- plastics[(length(plastics)-1)] + 500

autoplot(plastics3) +
  labs(title = "Monthly Sales, End Series Outlier",
       y="Plastic Sales", x = 'Month') + 
  autolayer((plastics3/decompose(plastics3, type="multiplicative")$seasonal),
            series = 'Decomposed')

We can see that an outlier near the end can also be handled pretty well by the decomposition.

6.3 Retail series

temp = tempfile(fileext = ".xlsx")
url <- 'https://otexts.com/fpp2/extrafiles/retail.xlsx'
download.file(url, destfile=temp, mode='wb')
retail <- readxl::read_excel(temp, skip = 1)
retail.ts <- ts(retail[,17], frequency=12, start=c(1982,4))

autoplot(retail.ts, main="Monthly Austrailian Retail Sales", ylab="Sales", xlab="Year")

library(seasonal)
retail.x11 <- seas(retail.ts, x11="")

autoplot(retail.x11, 
         main="Monthly Austrailian Retail Sales X11 Decomposition", 
         xlab="Year")

The last series in the decomposition graph reveals outliers around 1987, 1993/4, 2003, and others in the 2010s. We were previously unable to see these in the original graphs.

6.4 Interpretation of Graphs

a) Results

The graph shows the number of persons in the civilian labor force in Austrialia each month from February 1978 to August 1995. There is a strong positive trend in the graph. The decomposition shows the growth in the labor is cyclical with seasonality. Looking at the last series, there are some major outliers around 1992/3. This could be a one time event in Australia that affected the labor force, maybe something like a census accounting change. The second chart measures the degree of seasonality and shows larges for July and decrease in March.

b) Recession

The 1991/2 recession is visible from the irregular series at the end, we can see major decreases around those years in the labor force for the months in that year.

6.5 Canadian Gas

a) Different plots

data(cangas)

autoplot(cangas, 
         main = 'Canadian Oil Production',
         ylab = 'Gas Production in Billions of Cubic Metres',
         xlab = 'Year')

ggsubseriesplot(cangas, 
         main = 'Canadian Oil Production Sub Series',
         ylab = 'Gas Production in Billions of Cubic Metres',
         xlab = 'Month')

ggseasonplot(cangas,
             main = 'Canadian Oil Production Seasonal Plot',
             ylab = 'Gas Production in Billions of Cubic Metres',
             xlab = 'Month')

I think that the change in seasonality over the different years, the fact that it is less pronounced, might be due to two things. One is technological change that allows more steady production of oil throughout the year. The second is probably new oil reserves being found throughout time.

b) STL decomposition

stl(cangas, 
    t.window=13, 
    s.window="periodic", 
    robust=TRUE) %>%
  autoplot(main = 'STL Decomposition', xlab = 'Years')

c) Compare decompositions

The results of the decompositions are somewhat similar. The remainder plot shows us all the outliers that between 1980 and 1990 that might’ve been affecting prvious breakdowns. These outliers are smaller later on which explains the decrease in seasonality that was discussed in previous sections.

6.6 Canadian Clay Brick Production

a) STL decomposition

stl(bricksq, 
    t.window=20, 
    s.window="periodic", 
    robust=TRUE) %>%
  autoplot()

b) Seasonally adjusted plots

autoplot(bricksq, 
         main="Clay Production", ylab="Units", xlab="Year") + 
  autolayer((bricksq-decompose(bricksq, type="additive")$seasonal),
            series = 'Decomposed')

c) Naïve method

autoplot(bricksq, 
         main="Clay Production", ylab="Units", xlab="Year") +
  autolayer(naive((bricksq-decompose(bricksq, type="additive")$seasonal), 
                  h=30), series="Naïve")

d) Reseasonalize

brick.rs <- stlf(bricksq)
autoplot(brick.rs)

e) Residuals

checkresiduals(brick.rs)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 41.128, df = 6, p-value = 2.733e-07
## 
## Model df: 2.   Total lags used: 8

It looks like the residuals are for the most part normally distributed.

f) Robust STL

brick.rs <- stlf(bricksq, robust=TRUE)
autoplot(brick.rs)

checkresiduals(brick.rs)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 28.163, df = 6, p-value = 8.755e-05
## 
## Model df: 2.   Total lags used: 8

It doesn’t seem to have made much of a difference. In fact, there are more autocorrelation problems with the robust version.

g) Naive comparison

Brick.train <- subset(bricksq, end=length(bricksq) - 9)
Brick.test <- subset(bricksq, start=length(bricksq) - 8)

brick <- snaive(Brick.train)
brick.stlf <- stlf(Brick.train, robust=TRUE)

autoplot(brick)

autoplot(brick.stlf)

It looks like the forecasts with STLF are less variable and petter predict production.

6.7 Writing

autoplot(writing,
         main = 'Industry sales for printing and writing paper',
         ylab = 'Thousands of French francs',
         xlab = 'Year')

It looks like there is drift in the data.

stlf(writing, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(writing)) %>% 
autoplot()

6.8 Fancy revisited

autoplot(fancy,
         main = 'Souvenir Shop Sales', ylab = 'Sales',
         xlab = 'Year')

There is clearly a positive trend so lets once again use the drift adjustment.

stlf(fancy, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(writing)) %>% 
autoplot()